Code Re-construction
Volume Number: 2
Issue Number: 9
Column Tag: Nosy News
The Art of Code Re-construction
By Steve Jasik, Menlo Park, CA
In this special article by the famous industry guru Steve Jasek, author of
MacNosy, we get the inside scoop on the inner workings of Nosy and how to use it to best
advantage. Since Steve is famous for his brief documentation, these pearls of wisdom
will be most valuable, but be warned; this is not for the faint hearted, as Steve's
technical background in compiler design is clearly evident. -Ed.
Sniffing Around with MacNosy
As I sit here pecking away on my Mac, it is mid June. Nosy 2.099 has been
finalized and 2.1 is in progress. Most of the features I discuss in this article are in
2.099, and the rest will appear in 2.1 which will ship in July.
For starters, I'd like to try and answer some rhetorical questions such as:
What is and why use Nosy, and what advantage does it offer over other methods of
obtaining information about programs?
Hypertext for programmmers
Nosy is a fancy disassembler with extensive reference map (Hypertext for
programmmers?) and symbol substitution facilities that verges on being a
de-compiler. Nosy can be used to obtain information about any program or resource
that consists of 68000 native code. It does not handle Pcode, Mcode, etc. It can be used
to obtain accurate and informative listings of the ROM (64 or 128K), resources on the
system file, etc. It knows about the special structure of DRVR's (desk accessories) and
PDEF's. It can create '.asm' files that can be fed into the MDS assembler. Later on this
year it will be modified so that its output will be compatable with the MPW (Macintosh
Programmer's Workshop) assembler.
Because it subdivides the program up into procedures and data blocks, and
creates reference maps for those symbols, one is able to analyze what the program is
doing without having to execute it in many cases. When Nosy is used in conjunction
with a ".map" file produced by a Linker or the debug symbols left in the code by most
compilers one can use it as a "source" level cross reference map facility for one's own
program. This can aid in tracking down bugs or making modifications to the program.
[Note: TML, LightSpeed and Consulair support debug code in their compilation that aids
Nosy. -Ed.]
You Didn't Read this Section in MacTutor!
In the same vein one can use Nosy to locate and analyze copy protection code.
Modifications scheduled for V2.1 will let one disassemble CODE segments from programs
that are running in another partition under Switcher. This will allow one to analyze
copy protection code that has been encrypted in the disk file copy of the program. The
fun thing about this mod is that no program that is Switcher friendly will be able to
protect against this form of copy protection analysis as they have no real way of
knowing that Nosy is looking at them, and one will not need to make a debugger initially
active, so programs that check for the presence of one, and then blowup will be
hoodwinked. Note that some programs check for the presence of a debugger by
inspecting the interrupt vector address's ( 0 to $100) to ensure that they reference
ROM and blowup if they don't.
Fig. 1 Call Graph
Given the proper order of things in starting up Switcher, Nosy and the program
to be analyzed, Nosy will temporarly put any debugger to "sleep" so any other program
cannot detect its presence. After one has patched the code in the program that disables
the debugger, one can wake up the debugger, and use a combination of it and Nosy to
locate and disable the copy protection code. I am temped to offer a prize to anyone who
can devise a method that cannot be cracked by Nosy. [Ouch! -Ed.]
More Coming
While I am in the process of teaching Nosy how to locate the CODE segments of
running programs, I thought it would also be nice if Nosy was fixed so that it could
disassemble arbitrarly large programs in 512K.
My motives for adding this facility at this time are two. Firstly to allow the
analysis of programs that would not fit into memory, and secondly I will be needing
more space for tables in order to do global data flow analysis soon.
You can also use Nosy for code inspection of your own application. As Nosy grows
in size, I use Nosy on itself to look at the code generated by Lisa Pascal to see what code
it is generating for a given sequence of statements, and when the code looks like it can
be trimmed by choosing another sequence I rewrite it by using the type-casting
facility, etc.
Other reasons for using Nosy are to learn about 68000 code, and how Macintosh
applications are structured. The Nosy disk contains a number of interesting examples,
including a fast Shell sort which is extensively commented, and describes my assembly
language coding techniques.
Fig. 2 Flow Graph for above example
Compilers and Data Flow Analysis
Nosy makes use of a variety of techniques that are similar to those used in
compilers. For example, it has a variety of symbol tables, which use the address of the
symbol instead of the name as the primary lookup key. It also uses control and data
flow analysis to obtain information about the structure of the program, but first some
terminology and review. Control flow analysis, which concerns itself with the flow of
execution in a program, is familiar to many programmmers.
A Basic block consists of a sequence of contigious statements such that if any
statement in the block is executed, all are. Basic blocks end prior to an active label or
after a jump instruction which would be representated in a higher level language by an
if, case or goto statment. In a flow graph basic blocks are the nodes (the circles) and
the edges are the possible flow paths of execution. (See fig. 1 for a diagram of a Basic
Block). The usual conventions are that flow is from top to bottom with backward
branchs representated as curved lines. The ground symbol represents the return
statement. In the example below we have a simple for loop with a side branch that the
programmmer would have written as:
for j := Low to upLim do
if a[j] <> 0 then B[j] := B[j] / a[j];
Return
In the process of analyzing the code, compilers will invent labels for basic
blocks that are the target of branch instructions, which is why I rewrote the code in the
diagram in fig. 2.
The process of collecting the flow of control information is a fairly simple one.
Analyzing it to find the loops in the program, etc is slightly more complicated, but
straightforward. When one does this on an inter-procedural level the resulting graph
is called the "program call graph." Building a complete call graph in the presence of
procedures that accept parameters as arguments is somewhat more difficult. For
compilers this is no real problem, as very few of them do inter-procedural
optimization. For Nosy it represents a real problem, as initially it does not know what
the arguments to user procedures are or their types. To date, my solution to the
problem has been to let the user supply the information manually by the Review data
and IsProc commands which are discussed in the next section.
Fig. 3 Review Data command in window mode
Before we can discuss how Nosy can complete the call graph automatically, I need
to say a bit about data flow analysis, which is the process of collecting information
about the way variables are used in a program. In order to do effective global
optimization or de-compilation one has to collect both control and data flow
information. The exact sequence of events is that the control flow information for a
procedure and the data flow tables for the basic blocks in a procedure are built first.
Then the control flow information is used to "solve" the data flow equations globally.
The questions that an optimizing compiler will be able to answer with the data flow
information are:
What variables are not defined in the loop? The resulting set of expressions that
use the variables are candidates for code motion out of the loop, etc.
When does the value of a variable that is assigned to a register in a loop have to
be placed in memory prior to exit (is Live on exit) from a loop or more generally a
basic block? This is usually called the Live/Dead information.
The information that a compiler collects for each variable/array in a basic block
is called the use/def info, and consists of three bits for each variable; used in the block,
defined in the block and used before definitation. From this and the control flow
information one can develop the Live/Dead information.
For most real languages, the problem of building the use/def tables is
compounded by alaising of names due to the presence of array references, or pointer
variables which may point to the same location in memory at run-time. Most of you
are familiar with the problem, but without giving it a name. As a simple example of
alaising, consider the following sequence where p is a pointer to an integer:
p := addr(i); { p is equivalenced or alaised to i }
x := A[i];
p^ := 2;
y := A[i]; {would it be valid to change this to y := x; ?? }
Given the above code it would be imprudent for an optimizing compiler to replace
the second reference to A[i] ] with x, for example, since the value of i has been changed
directly in memory, and thus the programmer expects x and y will have different
values!
Fig. 4 IsProc command
The primary effect of alaising is to inhibit code optimization. Compilers for
languages such as C, with its weak typing of pointer variables have a difficult time in
producing good code in the presence of stores to pointer based variables.
Now back to Nosy. Another form of data flow problem concerns itself with
determining the type of a variable based on its usage. This is the problem that Nosy has
to solve if it is to automatically recognize procedural parameters and propagate this
information back to the calls so that all code blocks will be recognized as such. To do
this Nosy must perform a symbolic simulation of the 68000 registers and stack as it
disassembles the instructions to determine the usage of variables in a basic block, and
that it keep enough control flow information around so it can propagate the type
information out of the procedure and back to the caller. I started to work on this
problem a few weeks ago and got sidetracked by the needs of my documentator to
complete window mode so that we could document the visable parts of Nosy. With luck
I'll get back to it in time to show it off at the August MacWorld show.
As a final note in this section, the topics discussed here are relatively old hat,
[! -Ed.] and one can find a more complete discussion of optimization and flow analysis
in chapter 10 of the "Red Dragon" book by Aho, Sethi & Ullman (Addison-Wesley). It's
correct title is "Compilers, Principles, techniques and Tools". It is a standard text
book for Comp Sci undergraduates. I recommend its purchase to anyone who is
interested in understanding the structure of compilers.
Fig. 5 Case Jumps
Using Nosy to reformat Data
Nosy analyzes a program by walking the tree of procedures to build the program
call graph. At present, the treewalk is a one pass algorithm which recognizes the
procedure entry points by their references in JSR's. Procedures that are passed as
parameters (a common practice in Mac programs) are not recognized as code unless
there is an explicit reference to them in a JSR. This creates the situation that not all
the code is exposed by Nosy. To get around this problem, I created the Review data
command which lets one inspect the data blocks, and reformat them as one wishs.
Now I have converted the Review data command to window mode, one can see the
context in which a data block is referenced and do a more inteligent job of redefining the
format of the block, be it code or otherwise.
When the Review data command is selected, Nosy cleans up the desktop by closing
all the active windows except for the "-Notes-" window which is a scratchpad for your
and Nosy's use. It then puts up a modeless dialog window for keyboard entry at the top
of the screen, a menu window below it, a "-Data Blk-" window which displays the data
block to be modified, and a '-Uses=" window with a display of a reference to the data
block, if any exists. In Review data mode, all keyboard input is directed to the command
window.
The illustration in figure 3 [RevDAT 1] shows the label data410 passed to
APPENDST which we might guess to be a string concatenate routine. With this
information, it is easy to surmise that data410 is a Pascal string. If one wanted to find
out what APPENDST did, all one has to do is double click on the name and ctl-D to bring
up a display of it. The only reason for reformatting all the data blocks in a program is
if one wanted to "Cut and Paste" the code in it into another application.
An alternative way of reformatting data blocks in window mode is to mark them
as procedure entry points with the "Is Proc" command. Data blocks marked as "IsProc
will be treated as the entry point to a procedure on all subsequent treewalks. In some
programs, such as the PTCH resource in the System file, which contains the patchs to
the ROM, it is easier to mark the data labels with the IsProc command then it is via
Review data. Figure 4 [Ptch IsProc] illustrates the use of this command.
One hi-lites a data label, and then selects the "Is Proc" command. As an aside,
note that most of the System Globals that begin with a lowercase j contain a pointer to
the address of a routine, so one may Isproc them with impunity.
Case Statment Analysis
Another reason that all the code may not be apparant to Nosy is because of case
statements that it cannot understand. The Case Jumps display (Fig. 5 next page) lets
you determine which jumps are causing Nosy problems. The "Link Jump to Table
command lets you specify information about the jump so that Nosy can process it
correctly the next time a treewalk is done. In figure 6 [Lnk JMP2] we see that the
offending case jump table consists entirely of negative entries so Nosy has declined to
set the length of the table at 7, which can be done with the dialog box in the upper left
hand corner of the display. The basic problem is that during the treewalk when Nosy
inspects the jump instruction, it does not know the length of the data block. Because all
the jump table entries are negative, it is not sure where the table ends. When a jump
table has at least one positive entry, Nosy can use that entry to determine the length of
the jump table based on the assumption that code for the cases immediately follow the
jump table.
The table format of this case statement is JUMPP, which means that the the
jump table is to be interpreted as program relocatable offsets to procedure labels. The
JUMPC format defines a case or switch table to be a set of jumps to com_ labels (code
which is branched to by a JMP, BRA or Bxx instruction). The normal format for a case
table which consists of jumps to labels local to a procedure is JUMPL. It is sometimes
useful to break up a big procedure with a case table into smaller procedures by
changing the format of the case table from JUMPL to JUMPC.
Closing thoughts
In writing the above section I was temped to go back, reinspect my algorithms
and see if I couldn't automatically recognize the case discussed in the first example.
After all, it is easier to solve the problem once in the product as opposed to have to
explain it many times to others. Unfortunately Nosy does need human help in
recognizing some patterns, and those of us who use it should be familiar with techniques
for reformatting data, etc. Some time ago a famous industrial engineer, Peter Denning
told an audience of Twist Drill manufacturers that what their customers wanted was not
drills, but holes. I remind myself that what you want is information without hassle,
and to that end I will continue to improve Nosy so the de-compilation process runs
more on automatic, and less on manual. [Please help support Steve's efforts in our
behalf by both BUYING Nosy and spreading the WORD, not the DISK. -Ed.]
Fig. 6 Correcting Table Jumps